Overview

Dataset statistics

Number of variables23
Number of observations434605
Missing cells1340989
Missing cells (%)13.4%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory76.3 MiB
Average record size in memory184.0 B

Variable types

CAT14
NUM9

Reproduction

Analysis started2020-07-09 22:47:48.641904
Analysis finished2020-07-09 22:50:53.880440
Duration3 minutes and 5.24 seconds
Versionpandas-profiling v2.8.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml

Warnings

region has a high cardinality: 403 distinct values High cardinality
model has a high cardinality: 33313 distinct values High cardinality
vin has a high cardinality: 141403 distinct values High cardinality
state has a high cardinality: 51 distinct values High cardinality
price_scaled is highly correlated with priceHigh correlation
price is highly correlated with price_scaledHigh correlation
manufacturer has 20442 (4.7%) missing values Missing
model has 6019 (1.4%) missing values Missing
condition has 186345 (42.9%) missing values Missing
cylinders has 165921 (38.2%) missing values Missing
odometer has 74292 (17.1%) missing values Missing
vin has 195579 (45.0%) missing values Missing
drive has 121473 (28.0%) missing values Missing
size has 295175 (67.9%) missing values Missing
type has 116543 (26.8%) missing values Missing
paint_color has 134689 (31.0%) missing values Missing
lat has 8227 (1.9%) missing values Missing
long has 8227 (1.9%) missing values Missing
price is highly skewed (γ1 = 158.2481522) Skewed
odometer is highly skewed (γ1 = 40.76855782) Skewed
price_scaled is highly skewed (γ1 = 158.2481522) Skewed
df_index has unique values Unique
id has unique values Unique
price has 30640 (7.1%) zeros Zeros
price_scaled has 30640 (7.1%) zeros Zeros

Variables

df_index
Real number (ℝ≥0)

UNIQUE

Distinct count434605
Unique (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean217933.09336754063
Minimum0
Maximum435848
Zeros1
Zeros (%)< 0.1%
Memory size3.3 MiB

Quantile statistics

Minimum0
5-th percentile21778.2
Q1108955
median217937
Q3326902
95-th percentile414069.8
Maximum435848
Range435848
Interquartile range (IQR)217947

Descriptive statistics

Standard deviation125826.7724
Coefficient of variation (CV)0.5773642289
Kurtosis-1.200088501
Mean217933.0934
Median Absolute Deviation (MAD)108974
Skewness-0.000172233651
Sum9.471481204e+10
Variance1.583237665e+10
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
20471< 0.1%
 
4271301< 0.1%
 
4353261< 0.1%
 
871681< 0.1%
 
892171< 0.1%
 
830741< 0.1%
 
851231< 0.1%
 
953641< 0.1%
 
974131< 0.1%
 
912701< 0.1%
 
Other values (434595)434595> 99.9%
 
ValueCountFrequency (%) 
01< 0.1%
 
11< 0.1%
 
21< 0.1%
 
31< 0.1%
 
41< 0.1%
 
ValueCountFrequency (%) 
4358481< 0.1%
 
4358471< 0.1%
 
4358461< 0.1%
 
4358451< 0.1%
 
4358441< 0.1%
 

id
Real number (ℝ≥0)

UNIQUE

Distinct count434605
Unique (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean7115953655.941774
Minimum7096577274
Maximum7121608239
Zeros0
Zeros (%)0.0%
Memory size3.3 MiB

Quantile statistics

Minimum7096577274
5-th percentile7107232076
Q17112448897
median7117092599
Q37120091568
95-th percentile7121302169
Maximum7121608239
Range25030965
Interquartile range (IQR)7642671

Descriptive statistics

Standard deviation4591549.938
Coefficient of variation (CV)0.0006452473077
Kurtosis-0.8476639242
Mean7115953656
Median Absolute Deviation (MAD)3479851
Skewness-0.6056791629
Sum3.092629039e+15
Variance2.108233083e+13
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
71198330871< 0.1%
 
71204627451< 0.1%
 
71170138671< 0.1%
 
71178536161< 0.1%
 
71172207761< 0.1%
 
71175800401< 0.1%
 
71214590711< 0.1%
 
71131597711< 0.1%
 
71161660631< 0.1%
 
71099299041< 0.1%
 
Other values (434595)434595> 99.9%
 
ValueCountFrequency (%) 
70965772741< 0.1%
 
71042708321< 0.1%
 
71042717881< 0.1%
 
71042725291< 0.1%
 
71055984101< 0.1%
 
ValueCountFrequency (%) 
71216082391< 0.1%
 
71216078731< 0.1%
 
71216077871< 0.1%
 
71216077061< 0.1%
 
71216073681< 0.1%
 

region
Categorical

HIGH CARDINALITY

Distinct count403
Unique (%)0.1%
Missing0
Missing (%)0.0%
Memory size3.3 MiB
springfield
 
3588
jacksonville
 
3457
columbus
 
3280
fayetteville
 
3135
richmond
 
3042
Other values (398)
418103
ValueCountFrequency (%) 
springfield35880.8%
 
jacksonville34570.8%
 
columbus32800.8%
 
fayetteville31350.7%
 
richmond30420.7%
 
salem29890.7%
 
portland29830.7%
 
des moines29800.7%
 
boise29790.7%
 
fresno / madera29790.7%
 
Other values (393)40319392.8%
 

Length

Max length26
Median length11
Mean length11.4778339
Min length4

price
Real number (ℝ≥0)

HIGH CORRELATION
SKEWED
ZEROS

Distinct count16735
Unique (%)3.9%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean135241.1402146777
Minimum0
Maximum3647256576
Zeros30640
Zeros (%)7.1%
Memory size3.3 MiB

Quantile statistics

Minimum0
5-th percentile0
Q14900
median9995
Q317988
95-th percentile34590
Maximum3647256576
Range3647256576
Interquartile range (IQR)13088

Descriptive statistics

Standard deviation16932750.66
Coefficient of variation (CV)125.2041401
Kurtosis26745.48215
Mean135241.1402
Median Absolute Deviation (MAD)6000
Skewness158.2481522
Sum5.877647574e+10
Variance2.867180451e+14
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
0306407.1%
 
699539630.9%
 
799538770.9%
 
450038020.9%
 
599537480.9%
 
350037160.9%
 
899536670.8%
 
550034820.8%
 
650033940.8%
 
999533680.8%
 
Other values (16725)37094885.4%
 
ValueCountFrequency (%) 
0306407.1%
 
117880.4%
 
216< 0.1%
 
325< 0.1%
 
417< 0.1%
 
ValueCountFrequency (%) 
36472565761< 0.1%
 
33333333331< 0.1%
 
32685622611< 0.1%
 
29895429683< 0.1%
 
25251414681< 0.1%
 

year
Real number (ℝ≥0)

Distinct count72
Unique (%)< 0.1%
Missing1117
Missing (%)0.3%
Infinite0
Infinite (%)0.0%
Mean2010.07815210571
Minimum1950.0
Maximum2021.0
Zeros0
Zeros (%)0.0%
Memory size3.3 MiB

Quantile statistics

Minimum1950
5-th percentile1998
Q12007
median2012
Q32015
95-th percentile2018
Maximum2021
Range71
Interquartile range (IQR)8

Descriptive statistics

Standard deviation8.422884279
Coefficient of variation (CV)0.004190326764
Kurtosis12.2923228
Mean2010.078152
Median Absolute Deviation (MAD)4
Skewness-2.860329603
Sum871344758
Variance70.94497958
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
2017345928.0%
 
2015329187.6%
 
2016320967.4%
 
2014317037.3%
 
2013314347.2%
 
2012291086.7%
 
2011265326.1%
 
2008226435.2%
 
2007204574.7%
 
2018201474.6%
 
Other values (62)15185834.9%
 
ValueCountFrequency (%) 
1950120< 0.1%
 
1951113< 0.1%
 
195288< 0.1%
 
195393< 0.1%
 
195494< 0.1%
 
ValueCountFrequency (%) 
2021119< 0.1%
 
202028200.6%
 
2019155313.6%
 
2018201474.6%
 
2017345928.0%
 

manufacturer
Categorical

MISSING

Distinct count42
Unique (%)< 0.1%
Missing20442
Missing (%)4.7%
Memory size3.3 MiB
ford
77721
chevrolet
62193
toyota
 
34332
nissan
 
23135
honda
 
22408
Other values (37)
194374
ValueCountFrequency (%) 
ford7772117.9%
 
chevrolet6219314.3%
 
toyota343327.9%
 
nissan231355.3%
 
honda224085.2%
 
ram203004.7%
 
jeep196854.5%
 
gmc186444.3%
 
dodge143703.3%
 
bmw125162.9%
 
Other values (32)10885925.0%
 
(Missing)204424.7%
 

Length

Max length15
Median length5
Mean length5.639065358
Min length3

model
Categorical

HIGH CARDINALITY
MISSING

Distinct count33313
Unique (%)7.8%
Missing6019
Missing (%)1.4%
Memory size3.3 MiB
f-150
 
8513
silverado 1500
 
5457
1500
 
4690
silverado
 
3962
accord
 
3303
Other values (33308)
402661
ValueCountFrequency (%) 
f-15085132.0%
 
silverado 150054571.3%
 
150046901.1%
 
silverado39620.9%
 
accord33030.8%
 
camry32720.8%
 
altima31400.7%
 
escape29150.7%
 
grand cherokee29050.7%
 
250028730.7%
 
Other values (33303)38755689.2%
 
(Missing)60191.4%
 

Length

Max length73
Median length8
Mean length10.54961172
Min length1

condition
Categorical

MISSING

Distinct count6
Unique (%)< 0.1%
Missing186345
Missing (%)42.9%
Memory size3.3 MiB
excellent
118118
good
93693
like new
27535
fair
 
6895
new
 
1335
ValueCountFrequency (%) 
excellent11811827.2%
 
good9369321.6%
 
like new275356.3%
 
fair68951.6%
 
new13350.3%
 
salvage6840.2%
 
(Missing)18634542.9%
 

Length

Max length9
Median length4
Mean length5.185218762
Min length3

cylinders
Categorical

MISSING

Distinct count8
Unique (%)< 0.1%
Missing165921
Missing (%)38.2%
Memory size3.3 MiB
6 cylinders
95312
4 cylinders
85880
8 cylinders
81633
5 cylinders
 
2419
10 cylinders
 
1601
Other values (3)
 
1839
ValueCountFrequency (%) 
6 cylinders9531221.9%
 
4 cylinders8588019.8%
 
8 cylinders8163318.8%
 
5 cylinders24190.6%
 
10 cylinders16010.4%
 
other11020.3%
 
3 cylinders5320.1%
 
12 cylinders205< 0.1%
 
(Missing)16592138.2%
 

Length

Max length12
Median length11
Mean length7.934747644
Min length3

fuel
Categorical

Distinct count5
Unique (%)< 0.1%
Missing2989
Missing (%)0.7%
Memory size3.3 MiB
gas
375212
diesel
 
37802
other
 
13312
hybrid
 
4266
electric
 
1024
ValueCountFrequency (%) 
gas37521286.3%
 
diesel378028.7%
 
other133123.1%
 
hybrid42661.0%
 
electric10240.2%
 
(Missing)29890.7%
 

Length

Max length8
Median length3
Mean length3.363428861
Min length3

odometer
Real number (ℝ≥0)

MISSING
SKEWED

Distinct count108780
Unique (%)30.2%
Missing74292
Missing (%)17.1%
Infinite0
Infinite (%)0.0%
Mean98949.30538448515
Minimum0.0
Maximum10000000.0
Zeros2099
Zeros (%)0.5%
Memory size3.3 MiB

Quantile statistics

Minimum0
5-th percentile11077
Q147454
median91274
Q3134795
95-th percentile202964
Maximum10000000
Range10000000
Interquartile range (IQR)87341

Descriptive statistics

Standard deviation109739.8699
Coefficient of variation (CV)1.109051443
Kurtosis3180.005139
Mean98949.30538
Median Absolute Deviation (MAD)43710
Skewness40.76855782
Sum3.565272107e+10
Variance1.204283905e+10
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
020990.5%
 
15000010060.2%
 
1300009790.2%
 
1400009310.2%
 
1200008930.2%
 
1600008110.2%
 
1700007840.2%
 
2000007420.2%
 
1800007320.2%
 
1250006570.2%
 
Other values (108770)35067980.7%
 
(Missing)7429217.1%
 
ValueCountFrequency (%) 
020990.5%
 
13510.1%
 
252< 0.1%
 
374< 0.1%
 
429< 0.1%
 
ValueCountFrequency (%) 
100000003< 0.1%
 
999999910< 0.1%
 
98555001< 0.1%
 
92084831< 0.1%
 
81487001< 0.1%
 

title_status
Categorical

Distinct count6
Unique (%)< 0.1%
Missing1805
Missing (%)0.4%
Memory size3.3 MiB
clean
411801
rebuilt
 
11631
salvage
 
5568
lien
 
2857
missing
 
663
ValueCountFrequency (%) 
clean41180194.8%
 
rebuilt116312.7%
 
salvage55681.3%
 
lien28570.7%
 
missing6630.2%
 
parts only2800.1%
 
(Missing)18050.4%
 

Length

Max length10
Median length5
Mean length5.070539916
Min length3

transmission
Categorical

Distinct count3
Unique (%)< 0.1%
Missing2146
Missing (%)0.5%
Memory size3.3 MiB
automatic
386579
manual
 
28426
other
 
17454
ValueCountFrequency (%) 
automatic38657988.9%
 
manual284266.5%
 
other174544.0%
 
(Missing)21460.5%
 

Length

Max length9
Median length9
Mean length8.613511119
Min length3

vin
Categorical

HIGH CARDINALITY
MISSING

Distinct count141403
Unique (%)59.2%
Missing195579
Missing (%)45.0%
Memory size3.3 MiB
WA1LAAF78HD040006
 
124
77777777777777777
 
75
SALGS2KF6GA245355
 
68
1XPBDP9X5HD363709
 
64
1HSXLAPT67J411927
 
63
Other values (141398)
238632
ValueCountFrequency (%) 
WA1LAAF78HD040006124< 0.1%
 
7777777777777777775< 0.1%
 
SALGS2KF6GA24535568< 0.1%
 
1XPBDP9X5HD36370964< 0.1%
 
1HSXLAPT67J41192763< 0.1%
 
1F66F5KY2G0A0851259< 0.1%
 
1R9R1BF28JC82801259< 0.1%
 
WDCYC7DF1EX22721056< 0.1%
 
JM1NDAM74H010602052< 0.1%
 
5B4KP42Y01332779352< 0.1%
 
Other values (141393)23835454.8%
 
(Missing)19557945.0%
 

Length

Max length18
Median length17
Mean length10.67266368
Min length1

drive
Categorical

MISSING

Distinct count3
Unique (%)< 0.1%
Missing121473
Missing (%)28.0%
Memory size3.3 MiB
4wd
142685
fwd
111117
rwd
59330
ValueCountFrequency (%) 
4wd14268532.8%
 
fwd11111725.6%
 
rwd5933013.7%
 
(Missing)12147328.0%
 

Length

Max length3
Median length3
Mean length3
Min length3

size
Categorical

MISSING

Distinct count4
Unique (%)< 0.1%
Missing295175
Missing (%)67.9%
Memory size3.3 MiB
full-size
75109
mid-size
40227
compact
20874
sub-compact
 
3220
ValueCountFrequency (%) 
full-size7510917.3%
 
mid-size402279.3%
 
compact208744.8%
 
sub-compact32200.7%
 
(Missing)29517567.9%
 

Length

Max length11
Median length3
Mean length4.751118832
Min length3

type
Categorical

MISSING

Distinct count13
Unique (%)< 0.1%
Missing116543
Missing (%)26.8%
Memory size3.3 MiB
SUV
80146
sedan
79733
pickup
40869
truck
39441
coupe
17238
Other values (8)
60635
ValueCountFrequency (%) 
SUV8014618.4%
 
sedan7973318.3%
 
pickup408699.4%
 
truck394419.1%
 
coupe172384.0%
 
other128253.0%
 
hatchback123952.9%
 
van99622.3%
 
wagon98782.3%
 
convertible84982.0%
 
Other values (3)70771.6%
 
(Missing)11654326.8%
 

Length

Max length11
Median length5
Mean length4.415752235
Min length3

paint_color
Categorical

MISSING

Distinct count12
Unique (%)< 0.1%
Missing134689
Missing (%)31.0%
Memory size3.3 MiB
white
80052
black
59512
silver
44669
blue
30380
grey
30320
Other values (7)
54983
ValueCountFrequency (%) 
white8005218.4%
 
black5951213.7%
 
silver4466910.3%
 
blue303807.0%
 
grey303207.0%
 
red290536.7%
 
green75001.7%
 
custom71771.7%
 
brown65171.5%
 
yellow20390.5%
 
Other values (2)26970.6%
 
(Missing)13468931.0%
 

Length

Max length6
Median length4
Mean length4.237003716
Min length3

state
Categorical

HIGH CARDINALITY

Distinct count51
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size3.3 MiB
ca
 
46497
fl
 
31560
tx
 
24496
or
 
18051
nc
 
17026
Other values (46)
296975
ValueCountFrequency (%) 
ca4649710.7%
 
fl315607.3%
 
tx244965.6%
 
or180514.2%
 
nc170263.9%
 
ny168403.9%
 
oh162503.7%
 
mi147373.4%
 
wi140543.2%
 
tn125832.9%
 
Other values (41)22251151.2%
 

Length

Max length2
Median length2
Mean length2
Min length2

lat
Real number (ℝ)

MISSING

Distinct count49177
Unique (%)11.5%
Missing8227
Missing (%)1.9%
Infinite0
Infinite (%)0.0%
Mean38.4037173583534
Minimum-83.1971
Maximum79.6019
Zeros0
Zeros (%)0.0%
Memory size3.3 MiB

Quantile statistics

Minimum-83.1971
5-th percentile28.11497
Q134.2216
median38.933
Q342.4845
95-th percentile47.1991
Maximum79.6019
Range162.799
Interquartile range (IQR)8.2629

Descriptive statistics

Standard deviation6.038350579
Coefficient of variation (CV)0.1572334918
Kurtosis7.576887268
Mean38.40371736
Median Absolute Deviation (MAD)3.9186
Skewness-0.3766089785
Sum16374500.2
Variance36.46167771
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
33.779247771.1%
 
33.786538150.9%
 
40.468821470.5%
 
47.696120060.5%
 
47.798919850.5%
 
46.234819140.4%
 
47.656114370.3%
 
43.182414060.3%
 
41.134511550.3%
 
38.382611490.3%
 
Other values (49167)40458793.1%
 
(Missing)82271.9%
 
ValueCountFrequency (%) 
-83.19711< 0.1%
 
-75.76031< 0.1%
 
-70.76682< 0.1%
 
-67.07091< 0.1%
 
-63.00971< 0.1%
 
ValueCountFrequency (%) 
79.60191< 0.1%
 
78.47331< 0.1%
 
67.67021< 0.1%
 
67.00221< 0.1%
 
66.86391< 0.1%
 

long
Real number (ℝ)

MISSING

Distinct count48385
Unique (%)11.3%
Missing8227
Missing (%)1.9%
Infinite0
Infinite (%)0.0%
Mean-94.96009014806579
Minimum-177.012
Maximum173.675
Zeros0
Zeros (%)0.0%
Memory size3.3 MiB

Quantile statistics

Minimum-177.012
5-th percentile-122.608
Q1-111.717
median-89.6767
Q3-81.3976
95-th percentile-73.08
Maximum173.675
Range350.687
Interquartile range (IQR)30.3194

Descriptive statistics

Standard deviation18.05832256
Coefficient of variation (CV)-0.1901674959
Kurtosis1.181555964
Mean-94.96009015
Median Absolute Deviation (MAD)9.8567
Skewness-0.6846858183
Sum-40488893.32
Variance326.1030135
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
-84.411847771.1%
 
-84.445438150.9%
 
-74.281721470.5%
 
-116.78120290.5%
 
-116.74219850.5%
 
-119.12819140.4%
 
-117.23714360.3%
 
-84.112214060.3%
 
-96.245811550.3%
 
-93.773411490.3%
 
Other values (48375)40456593.1%
 
(Missing)82271.9%
 
ValueCountFrequency (%) 
-177.0121< 0.1%
 
-170.2881< 0.1%
 
-161.8753< 0.1%
 
-160.0971< 0.1%
 
-160.0591< 0.1%
 
ValueCountFrequency (%) 
173.6751< 0.1%
 
139.3881< 0.1%
 
139.3481< 0.1%
 
133.771< 0.1%
 
127.7241< 0.1%
 

descwordcount
Real number (ℝ≥0)

Distinct count3291
Unique (%)0.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean372.01362156440905
Minimum1
Maximum11037
Zeros0
Zeros (%)0.0%
Memory size3.3 MiB

Quantile statistics

Minimum1
5-th percentile22
Q168
median228
Q3569
95-th percentile1112
Maximum11037
Range11036
Interquartile range (IQR)501

Descriptive statistics

Standard deviation441.2416145
Coefficient of variation (CV)1.186089941
Kurtosis23.04361022
Mean372.0136216
Median Absolute Deviation (MAD)185
Skewness3.302504381
Sum161678980
Variance194694.1624
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
4322450.5%
 
3222040.5%
 
2821970.5%
 
3621970.5%
 
3421740.5%
 
2721520.5%
 
3321470.5%
 
3721400.5%
 
3021370.5%
 
2421350.5%
 
Other values (3281)41287795.0%
 
ValueCountFrequency (%) 
195< 0.1%
 
2184< 0.1%
 
33810.1%
 
43560.1%
 
54180.1%
 
ValueCountFrequency (%) 
110372< 0.1%
 
861611< 0.1%
 
855111< 0.1%
 
53862< 0.1%
 
53722< 0.1%
 

price_scaled
Real number (ℝ≥0)

HIGH CORRELATION
SKEWED
ZEROS

Distinct count16735
Unique (%)3.9%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.708023754199345e-05
Minimum0.0
Maximum0.9999999999999999
Zeros30640
Zeros (%)7.1%
Memory size3.3 MiB

Quantile statistics

Minimum0
5-th percentile0
Q11.343475541e-06
median2.740415924e-06
Q34.931926127e-06
95-th percentile9.483840602e-06
Maximum1
Range1
Interquartile range (IQR)3.588450587e-06

Descriptive statistics

Standard deviation0.004642599256
Coefficient of variation (CV)125.2041401
Kurtosis26745.48215
Mean3.708023754e-05
Median Absolute Deviation (MAD)1.64507209e-06
Skewness158.2481522
Sum16.11525664
Variance2.155372785e-05
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
0306407.1%
 
1.917879879e-0639630.9%
 
2.192058561e-0638770.9%
 
1.233804068e-0638020.9%
 
1.643701197e-0637480.9%
 
9.596253861e-0737160.9%
 
2.466237242e-0636670.8%
 
1.50798275e-0634820.8%
 
1.782161431e-0633940.8%
 
2.740415924e-0633680.8%
 
Other values (16725)37094885.4%
 
ValueCountFrequency (%) 
0306407.1%
 
2.741786817e-1017880.4%
 
5.483573635e-1016< 0.1%
 
8.225360452e-1025< 0.1%
 
1.096714727e-0917< 0.1%
 
ValueCountFrequency (%) 
11< 0.1%
 
0.91392893911< 0.1%
 
0.89617009191< 0.1%
 
0.819668953< 0.1%
 
0.69233995891< 0.1%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

Missing values

Sample

First rows

df_indexidregionpriceyearmanufacturermodelconditioncylindersfuelodometertitle_statustransmissionvindrivesizetypepaint_colorstatelatlongdescwordcountprice_scaled
007119256118mohave county34952012.00jeeppatriotlike new4 cylindersgasnancleanautomaticNaNNaNNaNNaNsilveraz34.46-114.27450.00
117120880186oregon coast137502014.00bmw328i m-sportgoodNaNgas76237.00cleanautomaticNaNrwdNaNsedangreyor46.18-123.821480.00
227115048251greenville / upstate23002001.00dodgecaravanexcellent6 cylindersgas199000.00cleanautomaticNaNNaNNaNNaNNaNsc34.94-81.97460.00
337119250502mohave county90002004.00chevroletcolorado lsexcellent5 cylindersgas54000.00cleanautomatic1GCCS196448191644rwdmid-sizepickupredaz34.48-114.27650.00
447120433904maine02021.00NaNHonda-Nissan-Kia-Ford-Hyundai-VWNaNNaNothernancleanotherNaNNaNNaNNaNNaNme44.47-68.902240.00
557120432569maine5002010.00NaN$500 DOWN PROGRAMS!!!NaNNaNgasnancleanautomaticNaNNaNNaNNaNNaNme42.84-71.111620.00
667120431378maine02014.00fordf-150excellent8 cylindersgas0.00cleanautomaticS70024wdfull-sizepickupNaNme42.77-71.2410160.00
777120430837maine85002005.00fordmustang convertibleexcellent6 cylindersgas62800.00cleanautomatic1ZVHT84N355252184rwdmid-sizeconvertiblesilverme44.21-69.791130.00
887120857037oregon coast02012.00ram3500NaN6 cylindersdiesel116515.00cleanautomatic3C63D3KL1CG1558364wdNaNtruckNaNor45.41-122.6210490.00
997120844862oregon coast59502004.00hondaodyssey ex-l, reliable, eNaN6 cylindersgas102415.00rebuiltautomatic5FNRL18924B012679fwdNaNvanNaNor45.58-122.687330.00

Last rows

df_indexidregionpriceyearmanufacturermodelconditioncylindersfuelodometertitle_statustransmissionvindrivesizetypepaint_colorstatelatlongdescwordcountprice_scaled
4345954358397112254206rapid city / west SD299302016.00ram1500excellent8 cylindersgas30383.00cleanautomatic1C6RR7MT6GS2654084wdmid-sizetruckbluesd44.08-103.233600.00
4345964358407116829959helena249002017.00audiq3 premium plusexcellent4 cylindersgas27100.00cleanautomaticNaN4wdfull-sizeSUVsilvermt45.65-110.561250.00
4345974358417109272290richmond99952008.00buickenclaveexcellent6 cylindersgas145975.00cleanautomatic5GAEV23778J1484694wdNaNSUVbrownvanannan4280.00
4345984358427119281941mohave county24952006.00lincolntown carNaN8 cylindersgas126302.00cleanautomatic1LNHM82V76Y636936rwdNaNsedanwhiteaz34.46-114.291200.00
4345994358437115048966greenville / upstate469952019.00fordf250 diesel powerstroke 4x4like new8 cylindersdiesel55000.00cleanautomaticNaN4wdfull-sizepickupwhitesc34.80-82.39830.00
4346004358447119262300mohave county25002005.00fordf150fairNaNgas282866.00cleanautomaticNaNNaNfull-sizetruckwhiteaz35.24-113.99170.00
4346014358457112219717rapid city / west SD27002002.00toyotacamrygood6 cylindersgas194000.00cleanautomaticNaNfwdNaNNaNbluesd44.00-103.36290.00
4346024358467120896708oregon coast24502001.00fordfocusgood4 cylindersgas130484.00cleanautomaticNaNrwdcompactotherblackor45.53-123.09660.00
4346034358477120885819oregon coast89952013.00mazdamazda3NaNNaNgas93339.00cleanautomaticJM1BL1UPXD1758084fwdNaNsedanNaNor45.52-122.586640.00
4346044358487112215161rapid city / west SD65772010.00dodgegrand caravanNaNNaNgas148721.00cleanautomatic2D4RN5DX0AR140668fwdNaNmini-vanbluesd44.08-103.192990.00